{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Network Analysis--Using Null Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adapted from Professor Clauset's lectures and homeworks for Network Analysis and Modeling // Course page: http://tuvalu.santafe.edu/~aaronc/courses/5352/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#relatively fast networks package (pip install python-igraph) that I used for these homeworks\n",
    "import igraph \n",
    "# slow-and-steady networks package. fewer bugs, easier drawing\n",
    "import networkx as nx\n",
    "# plots!\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib import style\n",
    "%matplotlib inline\n",
    "\n",
    "# other packages\n",
    "from __future__ import division\n",
    "from random import random, shuffle\n",
    "from numpy import percentile\n",
    "from operator import itemgetter\n",
    "from tabulate import tabulate\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Graphs!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A good example of a real-world graph (because it happens to be one). For now it's just important to know that this is a graph of social interactions between 34 individuals involved in the same karate club. Drawing it less because it's informative, and more because plotting is fun."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "real_graph = nx.karate_club_graph()\n",
    "positions = nx.spring_layout(real_graph)\n",
    "nx.draw(real_graph, node_color = 'blue', pos = positions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now. What's the difference between that (^) drawing of nodes and edges and a completely random assembly of dots and lines? How can we quantify the difference between a social network, which we think probably has important structure, and a completely random network, whose structure contains very little useful information? Which aspects of a network can be explained by simple statistics like average degree, the number of nodes, or the degree distribution? Which characteristics of a network depend on a structure or generative process that could reveal an underlying truth about the way the network came about?\n",
    "\n",
    "The question to ask is: how likely is a specific network characteristic to have been generated by a random process?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Graph Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Erdös-Rényi Random Graph\n",
    "\n",
    "The simplest random graph you can think of. For a graph $G$ with $n$ nodes, each pair of nodes gets an (undirected) edge with probability $p$. There are ${n \\choose 2}$ pairs of nodes, so ${n \\choose 2}$ possible edges. Then the average degree of a node in this random graph is $(n-1)p$, where $(n-1)$ is the number of possible connections for a node $i$ and $p$ is the probability of that connection existing. Call the expected average degree $\\bar k = (n-1)p$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Giant Components\n",
    "\n",
    "One property that we see all the time in social graphs (and many other graphs) is the emergence of a \"giant\" connected component. The Erdös-Rényi also develops a giant component for certain parameter spaces. In fact, when the average degree is more than 1 we see a giant component emerging, and when it is more that 3 that giant component is all or almost all of the graph. That means that for a random graph with $p > \\frac{1}{n-1}$ we will always start to see a giant component. \n",
    "\n",
    "To demonstrate why this is true, consider $u$ to be the fraction of vertices not in the giant component. Then where $u$ is also the probability that a randomly chosen vertex $i$ does not belong to the giant component of the graph. For $i$ to not be a part of the giant component, for every other vertex $j$ ($n-1$ vertices), $i$ is either not connected to $j$ (with probability $1-p$), or $j$ is not connected to the giant component (with probability $pu$). Then:\n",
    "$$ u = ((1-p) + (pu))^{n-1} $$\n",
    "We can use $ p = \\frac{\\bar k}{n-1} $ to rewrite the expression as:\n",
    "$$ u = (1 - \\frac{\\bar k(1-u)}{n-1})^{n-1} $$\n",
    "And then taking the limit for large $n$ and using the fact that $\\lim_{x\\rightarrow\\infty}(1-\\frac{x}{n})^n = e^{-x}$:\n",
    "$$ u = e^{-k(1-u)} $$\n",
    "Now if $u$ is the fraction of vertices not in the giant component, call $S = 1-u$ the fraction of vertices in the giant component. Then:\n",
    "$$ S = e^{-\\bar kS} $$\n",
    "\n",
    "There is no closed-form solution to this equation, but below we can show a simulation of random graphs and the size of the largest connected component in each one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Use the same number of nodes for each example\n",
    "num_nodes = 500"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# list of the sizes of the largest components\n",
    "big_comp = []\n",
    "# number of nodes in the graph\n",
    "#num_nodes = 500\n",
    "# vector of edge probabilities\n",
    "p_values = [(1-x*.0001) for x in xrange(9850,10000)]\n",
    "# try it a few times to get a smoother curve\n",
    "iterations = 10\n",
    "for p in p_values:\n",
    "    size_comps = []\n",
    "    for h in xrange(0, iterations):\n",
    "        edge_list = []\n",
    "        for i in xrange(0,num_nodes):\n",
    "            for j in xrange(i,num_nodes):\n",
    "                if (random() < p):\n",
    "                    edge_list.append((i,j))\n",
    "        G = igraph.Graph(directed = False)\n",
    "        G.add_vertices(num_nodes)\n",
    "        G.add_edges(edge_list)\n",
    "        comps = [len(x) for x in G.clusters()]\n",
    "        size_comps.append(comps)\n",
    "    big_comp.append((sum([max(x) for x in size_comps])/len(size_comps)/float(num_nodes)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot([x*(num_nodes-1) for x in p_values], big_comp, '.')\n",
    "plt.title(\"Phase transitions in connectedness\")\n",
    "plt.ylabel(\"Fraction of nodes in the largest component\")\n",
    "plt.xlabel(\"Average degree (k = p(n-1)), {} < p < {}\".format(p_values[99],p_values[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clustering coefficient\n",
    "\n",
    "The clustering coefficient is a measure of how many trianges (completely connected triples) there are in a graph. You can think about it as the probability that if Alice knows Bob and Charlie, Bob also knows Charlie. The clustering coefficient of a graph is equal to $$ C = \\frac{\\text{(number of closed triples)}}{\\text{number of connected triples}} $$\n",
    "\n",
    "Finding the expected value of $C$ for a random graph is simple. For any 3 vertices, the probability that they are all connected is $p^3$ and the probability that at least 2 of them are connected is $p^2$. Then the expected values of closed triples (triangles) and connected triples respectiely are ${n \\choose 3}p^3 $ and ${n \\choose 3}p^2 $, and the expected value for $C$ is then $\\frac{p^3}{p^2} = p$. Notice in the above plot that the values for $p$ are very small, even when the graph is fully connected. In a randomly generated sparse graph (a graph where a small fraction of the total possible ${n \\choose 2}$ edges exist), the clustering coefficient $C$ is very low."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# vector of edge probabilities\n",
    "p_values_clustering = [x*.01 for x in xrange(0,100)]\n",
    "# try it a few times to get a smoother curve\n",
    "iterations = 1\n",
    "# store the clustering coefficient\n",
    "clustering = []\n",
    "for p in p_values_clustering:\n",
    "    size_comps = []\n",
    "    for h in xrange(0, iterations):\n",
    "        edge_list = []\n",
    "        for i in xrange(0,num_nodes):\n",
    "            for j in xrange(i,num_nodes):\n",
    "                if (random() < p):\n",
    "                    edge_list.append((i,j))\n",
    "        G = igraph.Graph(directed = False)\n",
    "        G.add_vertices(num_nodes)\n",
    "        G.add_edges(edge_list)\n",
    "        clustering.append((p, G.transitivity_undirected(mode=\"zero\")))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot([x[0]*(num_nodes-1) for x in clustering], [x[1] for x in clustering], '.')\n",
    "plt.title(\"Clustering coeff vs avg degree in a random graph\")\n",
    "plt.ylabel(\"Clustering coefficient\")\n",
    "plt.xlabel(\"Average degree (k = (n-1)p), 0 < p < 1\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Small diameter graphs\n",
    "\n",
    "So we know that the giant component is very likely, even for sparse graphs, and also that the clustering coefficient is very low, even for relatively dense graphs. This means that the graph is almost completely connected, and that it is, at least locally, pretty similar to a tree graph (acyclic). \n",
    "\n",
    "Consider that a graph has a mean degree $\\bar k$. Now consider the number of vertices reachable from some vertex in the graph, $i$, call the number of vertices that $i$ can reach $l$. Because the clustering coefficient is very low (the graph is locally tree-like), it is likely that any neighbor of $i$'s has a completely new set of neightbors ($k$ neighbors, less $i$, $k-1$ total new neighbors). Then for each step, you reach $k-1$ new vertices. Thus the number of vertices reachable in $l$ steps from some vertex $i$ is $(k-1)^l$.\n",
    "\n",
    "The diameter of a graph is the maximum number of steps $l$ one would have to take to reach any vetex from any other vertex, or the number of steps needed to make any vertex reachable.\n",
    "$$ (k-1)^l = n $$\n",
    "$$ l = \\frac{1}{log(k-1)}log(n) \\approx O(log(n))$$\n",
    "\n",
    "Thus the diamater of the graph grows as $O(log(n))$, or shows \"small world\" characteristics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# list of the average (over X iterations) diameters of the largest components\n",
    "diam = []\n",
    "# the degree distribution of the network for each average degree\n",
    "degrees = {}\n",
    "# vector of edge probabilities\n",
    "p_values = [(1-x*.0001) for x in xrange(9850,10000)]\n",
    "# try it a few times to get a smoother curve\n",
    "iterations = 10\n",
    "for p in p_values:\n",
    "    size_comps = []\n",
    "    diameters = []\n",
    "    for h in xrange(0, iterations):\n",
    "        edge_list = []\n",
    "        for i in xrange(0,num_nodes):\n",
    "            for j in xrange(i,num_nodes):\n",
    "                if (random() < p):\n",
    "                    edge_list.append((i,j))\n",
    "        G = igraph.Graph(directed = False)\n",
    "        G.add_vertices(num_nodes)\n",
    "        G.add_edges(edge_list)\n",
    "        diameters.append(G.diameter())\n",
    "        degrees[p*(num_nodes-1)] = G.degree()\n",
    "    diam.append(sum(diameters)/len(diameters))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "fig, ax1 = plt.subplots(figsize = (8,6))\n",
    "plt.title(\"Graph metrics vs avg degree in a random graph\", size = 16)\n",
    "ax1.plot([x*(num_nodes-1) for x in p_values], big_comp, 'o', color = \"red\", markersize=4)\n",
    "ax1.set_xlabel('Average degree (k = (n-1)p), 0 < p < 1', size = 16)\n",
    "\n",
    "ax1.set_ylim(0,1.01)\n",
    "ax1.set_xlim(0,6)\n",
    "# Make the y-axis label and tick labels match the line color.\n",
    "ax1.set_ylabel('Fraction of nodes in giant component', color='red', size = 16)\n",
    "ax1.grid(True)\n",
    "for tl in ax1.get_yticklabels():\n",
    "    tl.set_color('red')\n",
    "    tl.set_size(16)\n",
    "\n",
    "ax2 = ax1.twinx()\n",
    "ax2.set_xlim(0,6)\n",
    "ax2.plot([x*(num_nodes-1) for x in p_values], diam, 's', color = \"blue\", markersize=4)\n",
    "ax2.set_ylabel('Diameter of the giant component', color='blue', size = 16)\n",
    "for tl in ax2.get_yticklabels():\n",
    "    tl.set_color('blue')\n",
    "    tl.set_size(16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "avg_degree_near_5 = min(degrees.keys(), key = lambda x: abs(x-5))\n",
    "xy = Counter(degrees[avg_degree_near_5]).items()\n",
    "plt.bar([x[0] for x in xy], [x[1] for x in xy], edgecolor = \"none\", color = \"blue\")\n",
    "plt.ylabel(\"# of nodes with degree X\", size = 16)\n",
    "plt.xlabel(\"Degree\", size = 16)\n",
    "plt.title(\"Degree distribution of the random graph\", size = 16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A comparison with a real social graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"The number of nodes in the graph (all are connected): {}\".format(len(real_graph.nodes())))\n",
    "print(\"The number of edges in the graph: {}\".format(len(real_graph.edges())))\n",
    "print(\"The average degree: {}\".format(sum(nx.degree(real_graph).values())/len(real_graph.nodes())))\n",
    "print(\"The clustering coefficient: {}\".format(nx.average_clustering(real_graph)))\n",
    "print(\"The clustering coefficient that a random graph with the same degree would predict (k/(n-1)): {}\"\n",
    "      .format(sum(nx.degree(real_graph).values())/len(real_graph.nodes())/(len(real_graph.nodes())-1)))\n",
    "print(\"The diameter of the graph: {}\".format(nx.diameter(real_graph)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The Configuration Model\n",
    "\n",
    "Another random graph model: the configuration model. Instead of generating our own degree sequence, we use a specified degree sequence (say, use the degree sequence of a social graph that we have) and change how the edges are connected. This allows us to ask the question: \"how much of this characteristic is completely explained by degree?\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is an example of using the configuration model to create a null model of our \"real graph.\" Note that the algorithm that I am using works well for creating configuration models for large graphs, but produces more error on this smaller graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "A = []\n",
    "for v in real_graph.nodes():\n",
    "    for x in range(0, real_graph.degree(v)):\n",
    "        A.append(v)\n",
    "    shuffle(A)\n",
    "    # make the edge list\n",
    "    _E = [(A[2*x], A[2*x+1]) for x in range(0,int(len(A)/2))]\n",
    "    E = set([x for x in _E if x[0]!=x[1]])\n",
    "# add the edges to a new graph with the name node list\n",
    "C = real_graph.copy()\n",
    "C.remove_edges_from(real_graph.edges())\n",
    "C.add_edges_from(E)\n",
    "nx.draw(C, node_color = 'blue', pos = positions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "print(\"The number of nodes in the graph (all are connected): {}\".format(len(C.nodes())))\n",
    "print(\"The number of edges in the graph: {}\".format(len(C.edges())))\n",
    "print(\"The average degree: {}\".format(sum(nx.degree(C).values())/len(C.nodes())))\n",
    "print(\"The clustering coefficient: {}\".format(nx.average_clustering(C)))\n",
    "print(\"The clustering coefficient that a random graph with the same degree would predict (k/(n-1)): {}\"\n",
    "      .format(sum(nx.degree(real_graph).values())/len(C.nodes())/(len(C.nodes())-1)))\n",
    "print(\"The diameter of the graph: {}\".format(nx.diameter(C)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Asking questions using a null model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A famous example of centrality measuring on a social network is the Florentine Families graph. Padgett's reseach on this graph claims that the Medicci family's rise to power can be explained by their high centrality on the graph of business interactions between families in Italy during that time. We will use a null model (configuration model) of the graph to rearrange how edges are places without altering any node's degree to discover how much of the Medicci's power is determined by thier degree (ranther than other structural components of the graph)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# get the graph\n",
    "florentine_families = igraph.Nexus.get(\"padgett\")[\"PADGB\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's show the relative rankings of the families with respect to vertex degree in the network and with respect to our chosen centrality measure, harmonic centrality. I won't go into various centrality measures here, beyond to say that harmonic centrality is formulated: \n",
    "$$ c_i = \\frac{1}{n-1}\\sum_{i,i\\neq j}^{n-1}\\frac{1}{d_{ij}} $$\n",
    "where $d_{ij}$ is the geodesic distance between vertices $i$ and $j$. Basically, harmonic centrality is a measure of how close a vertes is to every other vertex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# degree centrality\n",
    "d = florentine_families.degree()\n",
    "d_rank = [(x, florentine_families.vs[x]['name'], d[x]) for x in range(0,len(florentine_families.vs()))]\n",
    "d_rank.sort(key = itemgetter(2), reverse = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# harmonic centrality\n",
    "distances = florentine_families.shortest_paths_dijkstra()\n",
    "h = [sum([1/x for x in dist if x != 0])/(len(distances)-1) for dist in distances]\n",
    "h_rank = [(x, florentine_families.vs[x]['name'], h[x]) for x in range(0,len(florentine_families.vs()))]\n",
    "h_rank.sort(key = itemgetter(2), reverse = True)\n",
    "# make the table\n",
    "d_table = []\n",
    "d_table.append([\"Rank (by degree)\", \"degree\", \"Rank (h centrality)\", \"harmonic\"])\n",
    "for n in xrange(0,len(florentine_families.vs())):\n",
    "    table_row = []\n",
    "    table_row.extend([d_rank[n][1], str(d_rank[n][2])[0:5]])\n",
    "    table_row.extend([h_rank[n][1], str(h_rank[n][2])[0:5]])\n",
    "    #table_row.extend([e_rank[n][1], str(e_rank[n][2])[0:5]])\n",
    "    #table_row.extend([b_rank[n][1], str(b_rank[n][2])[0:5]])\n",
    "    d_table.append(table_row)\n",
    "print tabulate(d_table)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the fun (?) part. Create a bunch of different random configuration models based on the florentine families graph, then measure the harmonic centrality on those graphs. The harmonic centality of a node on the null model will deend only on its degree (as the graph structure is now ranom)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "config_model_centrality = [[] for x in florentine_families.vs()]\n",
    "config_model_means = []\n",
    "hc_differences = [[] for x in range(0,16)]\n",
    "for i in xrange(0,1000):\n",
    "    # build a random graph based on the configuration model\n",
    "    C = florentine_families.copy()\n",
    "    # graph with the same edge list as G\n",
    "    C.delete_edges(None)\n",
    "    # print C.summary()\n",
    "    # Add random edges\n",
    "    # vertex list A\n",
    "    A = []\n",
    "    for v in florentine_families.vs().indices:\n",
    "        for x in range(0,florentine_families.degree(v)):\n",
    "            A.append(v)\n",
    "    shuffle(A)\n",
    "    # print A\n",
    "    # make the edge list\n",
    "    _E = [(A[2*x], A[2*x+1]) for x in range(0,int(len(A)/2))]\n",
    "    E = set([x for x in _E if x[0]!=x[1]])\n",
    "    # add the edges to C\n",
    "    # print E\n",
    "    C.add_edges(E)\n",
    "\n",
    "    # rank the vertices by harmonic centrality\n",
    "    C_distances = C.shortest_paths_dijkstra()\n",
    "    C_h = [sum([1/x for x in dist if x != 0])/(len(C_distances)-1) for dist in C_distances]\n",
    "    del C\n",
    "    for vertex in range(0,16):\n",
    "        hc_differences[vertex].append(h[vertex] - C_h[vertex])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "plt.plot([percentile(diff, 50) for diff in hc_differences], '--')\n",
    "plt.plot([percentile(diff, 25) for diff in hc_differences], 'r--')\n",
    "plt.plot([percentile(diff, 75) for diff in hc_differences], 'g--')\n",
    "\n",
    "plt.xticks(range(0,16))\n",
    "plt.gca().set_xticklabels(florentine_families.vs()['name'])\n",
    "plt.xticks(rotation = 90)\n",
    "\n",
    "plt.gca().grid(True)\n",
    "\n",
    "plt.ylabel(\"(centrality) - (centrality on the null model)\")\n",
    "plt.title(\"How much of harmonic centrality is explained by degree?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dashed red, blue and green lines repectively represent the 25th, 50th and 75th percentile of the difference in centralities between the real graph and the pool of null models. We can see that for the Mediccis, that difference is basically always high--their centrality on the real graph is higher than their centrality on this null model. This shows that there is in fact something important about their place structurally in the graph, as well as thier high degree."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## That's all folks!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
